{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "markdown"
    }
   },
   "source": [
    "# Proposition Chunking for Enhanced RAG\n",
    "\n",
    "In this notebook, I implement proposition chunking - an advanced technique to break down documents into atomic, factual statements for more accurate retrieval. Unlike traditional chunking that simply divides text by character count, proposition chunking preserves the semantic integrity of individual facts.\n",
    "\n",
    "Proposition chunking delivers more precise retrieval by:\n",
    "\n",
    "1. Breaking content into atomic, self-contained facts\n",
    "2. Creating smaller, more granular units for retrieval  \n",
    "3. Enabling more precise matching between queries and relevant content\n",
    "4. Filtering out low-quality or incomplete propositions\n",
    "\n",
    "Let's build a complete implementation without relying on LangChain or FAISS."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the Environment\n",
    "We begin by importing necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import json\n",
    "import fitz\n",
    "from openai import OpenAI\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Text from a PDF File\n",
    "To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_text_from_pdf(pdf_path):\n",
    "    \"\"\"\n",
    "    Extracts text from a PDF file and prints the first `num_chars` characters.\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file.\n",
    "\n",
    "    Returns:\n",
    "    str: Extracted text from the PDF.\n",
    "    \"\"\"\n",
    "    # Open the PDF file\n",
    "    mypdf = fitz.open(pdf_path)\n",
    "    all_text = \"\"  # Initialize an empty string to store the extracted text\n",
    "\n",
    "    # Iterate through each page in the PDF\n",
    "    for page_num in range(mypdf.page_count):\n",
    "        page = mypdf[page_num]  # Get the page\n",
    "        text = page.get_text(\"text\")  # Extract text from the page\n",
    "        all_text += text  # Append the extracted text to the all_text string\n",
    "\n",
    "    return all_text  # Return the extracted text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking the Extracted Text\n",
    "Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chunk_text(text, chunk_size=800, overlap=100):\n",
    "    \"\"\"\n",
    "    Split text into overlapping chunks.\n",
    "    \n",
    "    Args:\n",
    "        text (str): Input text to chunk\n",
    "        chunk_size (int): Size of each chunk in characters\n",
    "        overlap (int): Overlap between chunks in characters\n",
    "        \n",
    "    Returns:\n",
    "        List[Dict]: List of chunk dictionaries with text and metadata\n",
    "    \"\"\"\n",
    "    chunks = []  # Initialize an empty list to store the chunks\n",
    "    \n",
    "    # Iterate over the text with the specified chunk size and overlap\n",
    "    for i in range(0, len(text), chunk_size - overlap):\n",
    "        chunk = text[i:i + chunk_size]  # Extract a chunk of the specified size\n",
    "        if chunk:  # Ensure we don't add empty chunks\n",
    "            chunks.append({\n",
    "                \"text\": chunk,  # The chunk text\n",
    "                \"chunk_id\": len(chunks) + 1,  # Unique ID for the chunk\n",
    "                \"start_char\": i,  # Starting character index of the chunk\n",
    "                \"end_char\": i + len(chunk)  # Ending character index of the chunk\n",
    "            })\n",
    "    \n",
    "    print(f\"Created {len(chunks)} text chunks\")  # Print the number of created chunks\n",
    "    return chunks  # Return the list of chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the OpenAI API Client\n",
    "We initialize the OpenAI client to generate embeddings and responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the OpenAI client with the base URL and API key\n",
    "client = OpenAI(\n",
    "    base_url=\"https://api.studio.nebius.com/v1/\",\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\")  # Retrieve the API key from environment variables\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Vector Store Implementation\n",
    "We'll create a basic vector store to manage document chunks and their embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleVectorStore:\n",
    "    \"\"\"\n",
    "    A simple vector store implementation using NumPy.\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        # Initialize lists to store vectors, texts, and metadata\n",
    "        self.vectors = []\n",
    "        self.texts = []\n",
    "        self.metadata = []\n",
    "    \n",
    "    def add_item(self, text, embedding, metadata=None):\n",
    "        \"\"\"\n",
    "        Add an item to the vector store.\n",
    "        \n",
    "        Args:\n",
    "            text (str): The text content\n",
    "            embedding (List[float]): The embedding vector\n",
    "            metadata (Dict, optional): Additional metadata\n",
    "        \"\"\"\n",
    "        # Append the embedding, text, and metadata to their respective lists\n",
    "        self.vectors.append(np.array(embedding))\n",
    "        self.texts.append(text)\n",
    "        self.metadata.append(metadata or {})\n",
    "    \n",
    "    def add_items(self, texts, embeddings, metadata_list=None):\n",
    "        \"\"\"\n",
    "        Add multiple items to the vector store.\n",
    "        \n",
    "        Args:\n",
    "            texts (List[str]): List of text contents\n",
    "            embeddings (List[List[float]]): List of embedding vectors\n",
    "            metadata_list (List[Dict], optional): List of metadata dictionaries\n",
    "        \"\"\"\n",
    "        # If no metadata list is provided, create an empty dictionary for each text\n",
    "        if metadata_list is None:\n",
    "            metadata_list = [{} for _ in range(len(texts))]\n",
    "        \n",
    "        # Add each text, embedding, and metadata to the store\n",
    "        for text, embedding, metadata in zip(texts, embeddings, metadata_list):\n",
    "            self.add_item(text, embedding, metadata)\n",
    "    \n",
    "    def similarity_search(self, query_embedding, k=5):\n",
    "        \"\"\"\n",
    "        Find the most similar items to a query embedding.\n",
    "        \n",
    "        Args:\n",
    "            query_embedding (List[float]): Query embedding vector\n",
    "            k (int): Number of results to return\n",
    "            \n",
    "        Returns:\n",
    "            List[Dict]: Top k most similar items\n",
    "        \"\"\"\n",
    "        # Return an empty list if there are no vectors in the store\n",
    "        if not self.vectors:\n",
    "            return []\n",
    "        \n",
    "        # Convert query embedding to a numpy array\n",
    "        query_vector = np.array(query_embedding)\n",
    "        \n",
    "        # Calculate similarities using cosine similarity\n",
    "        similarities = []\n",
    "        for i, vector in enumerate(self.vectors):\n",
    "            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))\n",
    "            similarities.append((i, similarity))\n",
    "        \n",
    "        # Sort by similarity in descending order\n",
    "        similarities.sort(key=lambda x: x[1], reverse=True)\n",
    "        \n",
    "        # Collect the top k results\n",
    "        results = []\n",
    "        for i in range(min(k, len(similarities))):\n",
    "            idx, score = similarities[i]\n",
    "            results.append({\n",
    "                \"text\": self.texts[idx],\n",
    "                \"metadata\": self.metadata[idx],\n",
    "                \"similarity\": float(score)  # Convert to float for JSON serialization\n",
    "            })\n",
    "        \n",
    "        return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embeddings(texts, model=\"BAAI/bge-en-icl\"):\n",
    "    \"\"\"\n",
    "    Create embeddings for the given texts.\n",
    "    \n",
    "    Args:\n",
    "        texts (str or List[str]): Input text(s)\n",
    "        model (str): Embedding model name\n",
    "        \n",
    "    Returns:\n",
    "        List[List[float]]: Embedding vector(s)\n",
    "    \"\"\"\n",
    "    # Handle both string and list inputs\n",
    "    input_texts = texts if isinstance(texts, list) else [texts]\n",
    "    \n",
    "    # Process in batches if needed (OpenAI API limits)\n",
    "    batch_size = 100\n",
    "    all_embeddings = []\n",
    "    \n",
    "    # Iterate over the input texts in batches\n",
    "    for i in range(0, len(input_texts), batch_size):\n",
    "        batch = input_texts[i:i + batch_size]  # Get the current batch of texts\n",
    "        \n",
    "        # Create embeddings for the current batch\n",
    "        response = client.embeddings.create(\n",
    "            model=model,\n",
    "            input=batch\n",
    "        )\n",
    "        \n",
    "        # Extract embeddings from the response\n",
    "        batch_embeddings = [item.embedding for item in response.data]\n",
    "        all_embeddings.extend(batch_embeddings)  # Add the batch embeddings to the list\n",
    "    \n",
    "    # If input was a single string, return just the first embedding\n",
    "    if isinstance(texts, str):\n",
    "        return all_embeddings[0]\n",
    "    \n",
    "    # Otherwise, return all embeddings\n",
    "    return all_embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Proposition Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_propositions(chunk):\n",
    "    \"\"\"\n",
    "    Generate atomic, self-contained propositions from a text chunk.\n",
    "    \n",
    "    Args:\n",
    "        chunk (Dict): Text chunk with content and metadata\n",
    "        \n",
    "    Returns:\n",
    "        List[str]: List of generated propositions\n",
    "    \"\"\"\n",
    "    # System prompt to instruct the AI on how to generate propositions\n",
    "    system_prompt = \"\"\"Please break down the following text into simple, self-contained propositions. \n",
    "    Ensure that each proposition meets the following criteria:\n",
    "\n",
    "    1. Express a Single Fact: Each proposition should state one specific fact or claim.\n",
    "    2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.\n",
    "    3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.\n",
    "    4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.\n",
    "    5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.\n",
    "\n",
    "    Output ONLY the list of propositions without any additional text or explanations.\"\"\"\n",
    "\n",
    "    # User prompt containing the text chunk to be converted into propositions\n",
    "    user_prompt = f\"Text to convert into propositions:\\n\\n{chunk['text']}\"\n",
    "    \n",
    "    # Generate response from the model\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"meta-llama/Llama-3.2-3B-Instruct\",  # Using a stronger model for accurate proposition generation\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ],\n",
    "        temperature=0\n",
    "    )\n",
    "    \n",
    "    # Extract propositions from the response\n",
    "    raw_propositions = response.choices[0].message.content.strip().split('\\n')\n",
    "    \n",
    "    # Clean up propositions (remove numbering, bullets, etc.)\n",
    "    clean_propositions = []\n",
    "    for prop in raw_propositions:\n",
    "        # Remove numbering (1., 2., etc.) and bullet points\n",
    "        cleaned = re.sub(r'^\\s*(\\d+\\.|\\-|\\*)\\s*', '', prop).strip()\n",
    "        if cleaned and len(cleaned) > 10:  # Simple filter for empty or very short propositions\n",
    "            clean_propositions.append(cleaned)\n",
    "    \n",
    "    return clean_propositions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality Checking for Propositions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_proposition(proposition, original_text):\n",
    "    \"\"\"\n",
    "    Evaluate a proposition's quality based on accuracy, clarity, completeness, and conciseness.\n",
    "    \n",
    "    Args:\n",
    "        proposition (str): The proposition to evaluate\n",
    "        original_text (str): The original text for comparison\n",
    "        \n",
    "    Returns:\n",
    "        Dict: Scores for each evaluation dimension\n",
    "    \"\"\"\n",
    "    # System prompt to instruct the AI on how to evaluate the proposition\n",
    "    system_prompt = \"\"\"You are an expert at evaluating the quality of propositions extracted from text.\n",
    "    Rate the given proposition on the following criteria (scale 1-10):\n",
    "\n",
    "    - Accuracy: How well the proposition reflects information in the original text\n",
    "    - Clarity: How easy it is to understand the proposition without additional context\n",
    "    - Completeness: Whether the proposition includes necessary details (dates, qualifiers, etc.)\n",
    "    - Conciseness: Whether the proposition is concise without losing important information\n",
    "\n",
    "    The response must be in valid JSON format with numerical scores for each criterion:\n",
    "    {\"accuracy\": X, \"clarity\": X, \"completeness\": X, \"conciseness\": X}\n",
    "    \"\"\"\n",
    "\n",
    "    # User prompt containing the proposition and the original text\n",
    "    user_prompt = f\"\"\"Proposition: {proposition}\n",
    "\n",
    "    Original Text: {original_text}\n",
    "\n",
    "    Please provide your evaluation scores in JSON format.\"\"\"\n",
    "\n",
    "    # Generate response from the model\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"meta-llama/Llama-3.2-3B-Instruct\",\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ],\n",
    "        response_format={\"type\": \"json_object\"},\n",
    "        temperature=0\n",
    "    )\n",
    "    \n",
    "    # Parse the JSON response\n",
    "    try:\n",
    "        scores = json.loads(response.choices[0].message.content.strip())\n",
    "        return scores\n",
    "    except json.JSONDecodeError:\n",
    "        # Fallback if JSON parsing fails\n",
    "        return {\n",
    "            \"accuracy\": 5,\n",
    "            \"clarity\": 5,\n",
    "            \"completeness\": 5,\n",
    "            \"conciseness\": 5\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complete Proposition Processing Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_document_into_propositions(pdf_path, chunk_size=800, chunk_overlap=100, \n",
    "                                      quality_thresholds=None):\n",
    "    \"\"\"\n",
    "    Process a document into quality-checked propositions.\n",
    "    \n",
    "    Args:\n",
    "        pdf_path (str): Path to the PDF file\n",
    "        chunk_size (int): Size of each chunk in characters\n",
    "        chunk_overlap (int): Overlap between chunks in characters\n",
    "        quality_thresholds (Dict): Threshold scores for proposition quality\n",
    "        \n",
    "    Returns:\n",
    "        Tuple[List[Dict], List[Dict]]: Original chunks and proposition chunks\n",
    "    \"\"\"\n",
    "    # Set default quality thresholds if not provided\n",
    "    if quality_thresholds is None:\n",
    "        quality_thresholds = {\n",
    "            \"accuracy\": 7,\n",
    "            \"clarity\": 7,\n",
    "            \"completeness\": 7,\n",
    "            \"conciseness\": 7\n",
    "        }\n",
    "    \n",
    "    # Extract text from the PDF file\n",
    "    text = extract_text_from_pdf(pdf_path)\n",
    "    \n",
    "    # Create chunks from the extracted text\n",
    "    chunks = chunk_text(text, chunk_size, chunk_overlap)\n",
    "    \n",
    "    # Initialize a list to store all propositions\n",
    "    all_propositions = []\n",
    "    \n",
    "    print(\"Generating propositions from chunks...\")\n",
    "    for i, chunk in enumerate(chunks):\n",
    "        print(f\"Processing chunk {i+1}/{len(chunks)}...\")\n",
    "        \n",
    "        # Generate propositions for the current chunk\n",
    "        chunk_propositions = generate_propositions(chunk)\n",
    "        print(f\"Generated {len(chunk_propositions)} propositions\")\n",
    "        \n",
    "        # Process each generated proposition\n",
    "        for prop in chunk_propositions:\n",
    "            proposition_data = {\n",
    "                \"text\": prop,\n",
    "                \"source_chunk_id\": chunk[\"chunk_id\"],\n",
    "                \"source_text\": chunk[\"text\"]\n",
    "            }\n",
    "            all_propositions.append(proposition_data)\n",
    "    \n",
    "    # Evaluate the quality of the generated propositions\n",
    "    print(\"\\nEvaluating proposition quality...\")\n",
    "    quality_propositions = []\n",
    "    \n",
    "    for i, prop in enumerate(all_propositions):\n",
    "        if i % 10 == 0:  # Status update every 10 propositions\n",
    "            print(f\"Evaluating proposition {i+1}/{len(all_propositions)}...\")\n",
    "            \n",
    "        # Evaluate the quality of the current proposition\n",
    "        scores = evaluate_proposition(prop[\"text\"], prop[\"source_text\"])\n",
    "        prop[\"quality_scores\"] = scores\n",
    "        \n",
    "        # Check if the proposition passes the quality thresholds\n",
    "        passes_quality = True\n",
    "        for metric, threshold in quality_thresholds.items():\n",
    "            if scores.get(metric, 0) < threshold:\n",
    "                passes_quality = False\n",
    "                break\n",
    "        \n",
    "        if passes_quality:\n",
    "            quality_propositions.append(prop)\n",
    "        else:\n",
    "            print(f\"Proposition failed quality check: {prop['text'][:50]}...\")\n",
    "    \n",
    "    print(f\"\\nRetained {len(quality_propositions)}/{len(all_propositions)} propositions after quality filtering\")\n",
    "    \n",
    "    return chunks, quality_propositions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building Vector Stores for Both Approaches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_vector_stores(chunks, propositions):\n",
    "    \"\"\"\n",
    "    Build vector stores for both chunk-based and proposition-based approaches.\n",
    "    \n",
    "    Args:\n",
    "        chunks (List[Dict]): Original document chunks\n",
    "        propositions (List[Dict]): Quality-filtered propositions\n",
    "        \n",
    "    Returns:\n",
    "        Tuple[SimpleVectorStore, SimpleVectorStore]: Chunk and proposition vector stores\n",
    "    \"\"\"\n",
    "    # Create vector store for chunks\n",
    "    chunk_store = SimpleVectorStore()\n",
    "    \n",
    "    # Extract chunk texts and create embeddings\n",
    "    chunk_texts = [chunk[\"text\"] for chunk in chunks]\n",
    "    print(f\"Creating embeddings for {len(chunk_texts)} chunks...\")\n",
    "    chunk_embeddings = create_embeddings(chunk_texts)\n",
    "    \n",
    "    # Add chunks to vector store with metadata\n",
    "    chunk_metadata = [{\"chunk_id\": chunk[\"chunk_id\"], \"type\": \"chunk\"} for chunk in chunks]\n",
    "    chunk_store.add_items(chunk_texts, chunk_embeddings, chunk_metadata)\n",
    "    \n",
    "    # Create vector store for propositions\n",
    "    prop_store = SimpleVectorStore()\n",
    "    \n",
    "    # Extract proposition texts and create embeddings\n",
    "    prop_texts = [prop[\"text\"] for prop in propositions]\n",
    "    print(f\"Creating embeddings for {len(prop_texts)} propositions...\")\n",
    "    prop_embeddings = create_embeddings(prop_texts)\n",
    "    \n",
    "    # Add propositions to vector store with metadata\n",
    "    prop_metadata = [\n",
    "        {\n",
    "            \"type\": \"proposition\", \n",
    "            \"source_chunk_id\": prop[\"source_chunk_id\"],\n",
    "            \"quality_scores\": prop[\"quality_scores\"]\n",
    "        } \n",
    "        for prop in propositions\n",
    "    ]\n",
    "    prop_store.add_items(prop_texts, prop_embeddings, prop_metadata)\n",
    "    \n",
    "    return chunk_store, prop_store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query and Retrieval Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def retrieve_from_store(query, vector_store, k=5):\n",
    "    \"\"\"\n",
    "    Retrieve relevant items from a vector store based on query.\n",
    "    \n",
    "    Args:\n",
    "        query (str): User query\n",
    "        vector_store (SimpleVectorStore): Vector store to search\n",
    "        k (int): Number of results to retrieve\n",
    "        \n",
    "    Returns:\n",
    "        List[Dict]: Retrieved items with scores and metadata\n",
    "    \"\"\"\n",
    "    # Create query embedding\n",
    "    query_embedding = create_embeddings(query)\n",
    "    \n",
    "    # Search vector store for the top k most similar items\n",
    "    results = vector_store.similarity_search(query_embedding, k=k)\n",
    "    \n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compare_retrieval_approaches(query, chunk_store, prop_store, k=5):\n",
    "    \"\"\"\n",
    "    Compare chunk-based and proposition-based retrieval for a query.\n",
    "    \n",
    "    Args:\n",
    "        query (str): User query\n",
    "        chunk_store (SimpleVectorStore): Chunk-based vector store\n",
    "        prop_store (SimpleVectorStore): Proposition-based vector store\n",
    "        k (int): Number of results to retrieve from each store\n",
    "        \n",
    "    Returns:\n",
    "        Dict: Comparison results\n",
    "    \"\"\"\n",
    "    print(f\"\\n=== Query: {query} ===\")\n",
    "    \n",
    "    # Retrieve results from the proposition-based vector store\n",
    "    print(\"\\nRetrieving with proposition-based approach...\")\n",
    "    prop_results = retrieve_from_store(query, prop_store, k)\n",
    "    \n",
    "    # Retrieve results from the chunk-based vector store\n",
    "    print(\"Retrieving with chunk-based approach...\")\n",
    "    chunk_results = retrieve_from_store(query, chunk_store, k)\n",
    "    \n",
    "    # Display proposition-based results\n",
    "    print(\"\\n=== Proposition-Based Results ===\")\n",
    "    for i, result in enumerate(prop_results):\n",
    "        print(f\"{i+1}) {result['text']} (Score: {result['similarity']:.4f})\")\n",
    "    \n",
    "    # Display chunk-based results\n",
    "    print(\"\\n=== Chunk-Based Results ===\")\n",
    "    for i, result in enumerate(chunk_results):\n",
    "        # Truncate text to keep the output manageable\n",
    "        truncated_text = result['text'][:150] + \"...\" if len(result['text']) > 150 else result['text']\n",
    "        print(f\"{i+1}) {truncated_text} (Score: {result['similarity']:.4f})\")\n",
    "    \n",
    "    # Return the comparison results\n",
    "    return {\n",
    "        \"query\": query,\n",
    "        \"proposition_results\": prop_results,\n",
    "        \"chunk_results\": chunk_results\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Response Generation and Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_response(query, results, result_type=\"proposition\"):\n",
    "    \"\"\"\n",
    "    Generate a response based on retrieved results.\n",
    "    \n",
    "    Args:\n",
    "        query (str): User query\n",
    "        results (List[Dict]): Retrieved items\n",
    "        result_type (str): Type of results ('proposition' or 'chunk')\n",
    "        \n",
    "    Returns:\n",
    "        str: Generated response\n",
    "    \"\"\"\n",
    "    # Combine retrieved texts into a single context string\n",
    "    context = \"\\n\\n\".join([result[\"text\"] for result in results])\n",
    "    \n",
    "    # System prompt to instruct the AI on how to generate the response\n",
    "    system_prompt = f\"\"\"You are an AI assistant answering questions based on retrieved information.\n",
    "Your answer should be based on the following {result_type}s that were retrieved from a knowledge base.\n",
    "If the retrieved information doesn't answer the question, acknowledge this limitation.\"\"\"\n",
    "\n",
    "    # User prompt containing the query and the retrieved context\n",
    "    user_prompt = f\"\"\"Query: {query}\n",
    "\n",
    "Retrieved {result_type}s:\n",
    "{context}\n",
    "\n",
    "Please answer the query based on the retrieved information.\"\"\"\n",
    "\n",
    "    # Generate the response using the OpenAI client\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"meta-llama/Llama-3.2-3B-Instruct\",\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ],\n",
    "        temperature=0.2\n",
    "    )\n",
    "    \n",
    "    # Return the generated response text\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_responses(query, prop_response, chunk_response, reference_answer=None):\n",
    "    \"\"\"\n",
    "    Evaluate and compare responses from both approaches.\n",
    "    \n",
    "    Args:\n",
    "        query (str): User query\n",
    "        prop_response (str): Response from proposition-based approach\n",
    "        chunk_response (str): Response from chunk-based approach\n",
    "        reference_answer (str, optional): Reference answer for comparison\n",
    "        \n",
    "    Returns:\n",
    "        str: Evaluation analysis\n",
    "    \"\"\"\n",
    "    # System prompt to instruct the AI on how to evaluate the responses\n",
    "    system_prompt = \"\"\"You are an expert evaluator of information retrieval systems. \n",
    "    Compare the two responses to the same query, one generated from proposition-based retrieval \n",
    "    and the other from chunk-based retrieval.\n",
    "\n",
    "    Evaluate them based on:\n",
    "    1. Accuracy: Which response provides more factually correct information?\n",
    "    2. Relevance: Which response better addresses the specific query?\n",
    "    3. Conciseness: Which response is more concise while maintaining completeness?\n",
    "    4. Clarity: Which response is easier to understand?\n",
    "\n",
    "    Be specific about the strengths and weaknesses of each approach.\"\"\"\n",
    "\n",
    "    # User prompt containing the query and the responses to be compared\n",
    "    user_prompt = f\"\"\"Query: {query}\n",
    "\n",
    "    Response from Proposition-Based Retrieval:\n",
    "    {prop_response}\n",
    "\n",
    "    Response from Chunk-Based Retrieval:\n",
    "    {chunk_response}\"\"\"\n",
    "\n",
    "    # If a reference answer is provided, include it in the user prompt for factual checking\n",
    "    if reference_answer:\n",
    "        user_prompt += f\"\"\"\n",
    "\n",
    "    Reference Answer (for factual checking):\n",
    "    {reference_answer}\"\"\"\n",
    "\n",
    "    # Add the final instruction to the user prompt\n",
    "    user_prompt += \"\"\"\n",
    "    Please provide a detailed comparison of these two responses, highlighting which approach performed better and why.\"\"\"\n",
    "\n",
    "    # Generate the evaluation analysis using the OpenAI client\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"meta-llama/Llama-3.2-3B-Instruct\",\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ],\n",
    "        temperature=0\n",
    "    )\n",
    "    \n",
    "    # Return the generated evaluation analysis\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complete End-to-End Evaluation Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_proposition_chunking_evaluation(pdf_path, test_queries, reference_answers=None):\n",
    "    \"\"\"\n",
    "    Run a complete evaluation of proposition chunking vs standard chunking.\n",
    "    \n",
    "    Args:\n",
    "        pdf_path (str): Path to the PDF file\n",
    "        test_queries (List[str]): List of test queries\n",
    "        reference_answers (List[str], optional): Reference answers for queries\n",
    "        \n",
    "    Returns:\n",
    "        Dict: Evaluation results\n",
    "    \"\"\"\n",
    "    print(\"=== Starting Proposition Chunking Evaluation ===\\n\")\n",
    "    \n",
    "    # Process document into propositions and chunks\n",
    "    chunks, propositions = process_document_into_propositions(pdf_path)\n",
    "    \n",
    "    # Build vector stores for chunks and propositions\n",
    "    chunk_store, prop_store = build_vector_stores(chunks, propositions)\n",
    "    \n",
    "    # Initialize a list to store results for each query\n",
    "    results = []\n",
    "    \n",
    "    # Run tests for each query\n",
    "    for i, query in enumerate(test_queries):\n",
    "        print(f\"\\n\\n=== Testing Query {i+1}/{len(test_queries)} ===\")\n",
    "        print(f\"Query: {query}\")\n",
    "        \n",
    "        # Get retrieval results from both chunk-based and proposition-based approaches\n",
    "        retrieval_results = compare_retrieval_approaches(query, chunk_store, prop_store)\n",
    "        \n",
    "        # Generate responses based on the retrieved proposition-based results\n",
    "        print(\"\\nGenerating response from proposition-based results...\")\n",
    "        prop_response = generate_response(\n",
    "            query, \n",
    "            retrieval_results[\"proposition_results\"], \n",
    "            \"proposition\"\n",
    "        )\n",
    "        \n",
    "        # Generate responses based on the retrieved chunk-based results\n",
    "        print(\"Generating response from chunk-based results...\")\n",
    "        chunk_response = generate_response(\n",
    "            query, \n",
    "            retrieval_results[\"chunk_results\"], \n",
    "            \"chunk\"\n",
    "        )\n",
    "        \n",
    "        # Get reference answer if available\n",
    "        reference = None\n",
    "        if reference_answers and i < len(reference_answers):\n",
    "            reference = reference_answers[i]\n",
    "        \n",
    "        # Evaluate the generated responses\n",
    "        print(\"\\nEvaluating responses...\")\n",
    "        evaluation = evaluate_responses(query, prop_response, chunk_response, reference)\n",
    "        \n",
    "        # Compile results for the current query\n",
    "        query_result = {\n",
    "            \"query\": query,\n",
    "            \"proposition_results\": retrieval_results[\"proposition_results\"],\n",
    "            \"chunk_results\": retrieval_results[\"chunk_results\"],\n",
    "            \"proposition_response\": prop_response,\n",
    "            \"chunk_response\": chunk_response,\n",
    "            \"reference_answer\": reference,\n",
    "            \"evaluation\": evaluation\n",
    "        }\n",
    "        \n",
    "        # Append the results to the overall results list\n",
    "        results.append(query_result)\n",
    "        \n",
    "        # Print the responses and evaluation for the current query\n",
    "        print(\"\\n=== Proposition-Based Response ===\")\n",
    "        print(prop_response)\n",
    "        \n",
    "        print(\"\\n=== Chunk-Based Response ===\")\n",
    "        print(chunk_response)\n",
    "        \n",
    "        print(\"\\n=== Evaluation ===\")\n",
    "        print(evaluation)\n",
    "    \n",
    "    # Generate overall analysis of the evaluation\n",
    "    print(\"\\n\\n=== Generating Overall Analysis ===\")\n",
    "    overall_analysis = generate_overall_analysis(results)\n",
    "    print(\"\\n\" + overall_analysis)\n",
    "    \n",
    "    # Return the evaluation results, overall analysis, and counts of propositions and chunks\n",
    "    return {\n",
    "        \"results\": results,\n",
    "        \"overall_analysis\": overall_analysis,\n",
    "        \"proposition_count\": len(propositions),\n",
    "        \"chunk_count\": len(chunks)\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_overall_analysis(results):\n",
    "    \"\"\"\n",
    "    Generate an overall analysis of proposition vs chunk approaches.\n",
    "    \n",
    "    Args:\n",
    "        results (List[Dict]): Results from each test query\n",
    "        \n",
    "    Returns:\n",
    "        str: Overall analysis\n",
    "    \"\"\"\n",
    "    # System prompt to instruct the AI on how to generate the overall analysis\n",
    "    system_prompt = \"\"\"You are an expert at evaluating information retrieval systems.\n",
    "    Based on multiple test queries, provide an overall analysis comparing proposition-based retrieval \n",
    "    to chunk-based retrieval for RAG (Retrieval-Augmented Generation) systems.\n",
    "\n",
    "    Focus on:\n",
    "    1. When proposition-based retrieval performs better\n",
    "    2. When chunk-based retrieval performs better\n",
    "    3. The overall strengths and weaknesses of each approach\n",
    "    4. Recommendations for when to use each approach\"\"\"\n",
    "\n",
    "    # Create a summary of evaluations for each query\n",
    "    evaluations_summary = \"\"\n",
    "    for i, result in enumerate(results):\n",
    "        evaluations_summary += f\"Query {i+1}: {result['query']}\\n\"\n",
    "        evaluations_summary += f\"Evaluation Summary: {result['evaluation'][:200]}...\\n\\n\"\n",
    "\n",
    "    # User prompt containing the summary of evaluations\n",
    "    user_prompt = f\"\"\"Based on the following evaluations of proposition-based vs chunk-based retrieval across {len(results)} queries, \n",
    "    provide an overall analysis comparing these two approaches:\n",
    "\n",
    "    {evaluations_summary}\n",
    "\n",
    "    Please provide a comprehensive analysis on the relative strengths and weaknesses of proposition-based \n",
    "    and chunk-based retrieval for RAG systems.\"\"\"\n",
    "\n",
    "    # Generate the overall analysis using the OpenAI client\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"meta-llama/Llama-3.2-3B-Instruct\",\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ],\n",
    "        temperature=0\n",
    "    )\n",
    "    \n",
    "    # Return the generated analysis text\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation of Proposition Chunking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Path to the AI information document that will be processed\n",
    "pdf_path = \"data/AI_Information.pdf\"\n",
    "\n",
    "# Define test queries covering different aspects of AI to evaluate proposition chunking\n",
    "test_queries = [\n",
    "    \"What are the main ethical concerns in AI development?\",\n",
    "    # \"How does explainable AI improve trust in AI systems?\",\n",
    "    # \"What are the key challenges in developing fair AI systems?\",\n",
    "    # \"What role does human oversight play in AI safety?\"\n",
    "]\n",
    "\n",
    "# Reference answers for more thorough evaluation and comparison of results\n",
    "# These provide a ground truth to measure the quality of generated responses\n",
    "reference_answers = [\n",
    "    \"The main ethical concerns in AI development include bias and fairness, privacy, transparency, accountability, safety, and the potential for misuse or harmful applications.\",\n",
    "    # \"Explainable AI improves trust by making AI decision-making processes transparent and understandable to users, helping them verify fairness, identify potential biases, and better understand AI limitations.\",\n",
    "    # \"Key challenges in developing fair AI systems include addressing data bias, ensuring diverse representation in training data, creating transparent algorithms, defining fairness across different contexts, and balancing competing fairness criteria.\",\n",
    "    # \"Human oversight plays a critical role in AI safety by monitoring system behavior, verifying outputs, intervening when necessary, setting ethical boundaries, and ensuring AI systems remain aligned with human values and intentions throughout their operation.\"\n",
    "]\n",
    "\n",
    "# Run the evaluation\n",
    "evaluation_results = run_proposition_chunking_evaluation(\n",
    "    pdf_path=pdf_path,\n",
    "    test_queries=test_queries,\n",
    "    reference_answers=reference_answers\n",
    ")\n",
    "\n",
    "# Print the overall analysis\n",
    "print(\"\\n\\n=== Overall Analysis ===\")\n",
    "print(evaluation_results[\"overall_analysis\"])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-new-specific-rag",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
